Tokenizer Adapted for the Nasa Yuwe Language
نویسندگان
چکیده
In Colombia, ethnic and cultural diversity is conceived by the government to be a social right. Such diversity finds expression, among other ways, in a large number of indigenous languages, which have been kept alive for centuries. However, efforts toward conservation and preservation of these languages have generally fallen short. This is the case for the Nasa Yuwe language, spoken by the Nasa, or Páez, indigenous community, the status of which is endangered. Given such a predicament, the use of technology has been found to provide a strategic opportunity for adaptation, ownership, and development of Nasa Yuwe within the social and cultural environment of the Nasa people. The technology includes the use of computational techniques, which allow the exchange of information by means of IR activities. These encourage different, new possibilities for the Nasa people to be able to interact in Nasa Yuwe. It has therefore become necessary to adapt the stages of the IR process to this language. The current paper specifically presents a process for adapting a tokenizer to texts written in Nasa Yuwe. This involves making use of the precision-recall curve as an evaluation and comparison measure. The results presented allow appreciation of all stages in the process of adapting the standard tokenizer to produce the Nasa version, of the Nasa tokenizer and its results over texts written in Nasa Yuwe, and of the analysis of the precision-recall curve baseline in contrast to that of the Nasa tokenizer.
منابع مشابه
Pronunciation learning system for the 32 vowel system of Nasa Yuwe language
Nasa Yuwe is an indigenous language from Colombia (South America), it is, to some extent, an endangered language. Different efforts have been done to revitalize it, the most important of which being the unification of the Nasa Yuwe alphabet. The Nasa Yuwe vowel system has 32 vowels contrasting in nasalization, length, aspiration and glottalization, causing great confusion for the learner. In or...
متن کاملAn experience about developing educational tools for the Nasa Colombian people for revitalizing language and culture
Revitalizing mother tongue and culture is a priority for the Nasa Colombian native people [9, 2]. Since 1970 they are running processes to revert the consequences of the persecution and punishment that the use of Nasa Yuwe, the Nasa language, suffered during earlier periods of their history. This text briefly describes a project that aims to support such revitalization efforts through Human-Com...
متن کاملProducing a Persian Text Tokenizer Corpus Focusing on Its Computational Linguistics Considerations
The main task of the tokenization is to divide the sentences of the text into its constituent units and remove punctuation marks (dots, commas, etc.). Each unit is a continuous lexical or grammatical writing chain that is an independent semantic unit. Tokenization occurs at the word level and the extracted units can be used as input to other components such as stemmer. The requirement to create...
متن کاملExploiting variety-dependent Phones in Portuguese Variety Identification
This paper presents a new approach of building a language identification system using a specialized Phone Recognition system followed by Language Modeling (PRLM) to differentiate Portuguese varieties spoken in African Countries from European Portuguese. The system is designed to focus on exploiting the phonotactic information of a single discriminatively trained tokenizer for the specific pair ...
متن کاملArabic Tokenization System
Tokenization is a necessary and non-trivial step in natural language processing. In the case of Arabic, where a single word can comprise up to four independent tokens, morphological knowledge needs to be incorporated into the tokenizer. In this paper we describe a rule-based tokenizer that handles tokenization as a full-rounded process with a preprocessing stage (white space normalizer), and a ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016